Splitting matters: how monotone transformation of predictor variables may improve the predictions of decision tree models

نویسندگان

  • Tal Galili
  • Isaac Meilijson
چکیده

It is widely believed that the prediction accuracy of decision tree models is invariant under any strictly monotone transformation of the individual predictor variables. However, this statement may be false when predicting new observations with values that were not seen in the training-set and are close to the location of the split point of a tree rule. The sensitivity of the prediction error to the split point interpolation is high when the split point of the tree is estimated based on very few observations, reaching 9% misclassification error when only 10 observations are used for constructing a split, and shrinking to 1% when relying on 100 observations. This study compares the performance of alternative methods for split point interpolation and concludes that the best choice is taking the mid-point between the two closest points to the split point of the tree. Furthermore, if the (continuous) distribution of the predictor variable is known, then using its probability integral for transforming the variable (”quantile transformation”) will reduce the model’s interpolation error by up to about a half on average. Accordingly, this study provides guidelines for both developers and users of decision tree models (including bagging and random forest).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of Gestational Diabetes Prediction Between Logistic Regression, Discriminant Analysis, Decision Tree and Artificial Neural Network Models

Background and Objectives: Gestational Diabetes Mellitus (GDM) is the most common metabolic disorder in pregnancy. In case of early detection, some of its complications can be prevented. The aim of this study was to investigate early prediction of GDM by logistic regression (LR), discriminant analysis (DA), decision tree (DT) and perceptron artificial neural network (ANN) and to compare these m...

متن کامل

Ranking stocks of listed companies on Tehran stock exchange using a hybrid model of decision tree and logistic regression

Much research has introduced linear or nonlinear models using statistical models and machine learning tools in artificial intelligence to estimate Iran's rate of return. The primary purpose of these methods is simultaneously use different independent variables to improve stock return rates' modeling. However, in predicting the rate of return, in addition to the modeling method, the degree of co...

متن کامل

Comparison of gestational diabetes prediction with artificial neural network and decision tree models

Background: Gestational diabetes mellitus (GDM) is one of the most common metabolic disorders in pregnancy, which is associated with serious complications. In the event of early diagnosis of this disease, some of the maternal and fetal complications can be prevented. The aim of this study was to early predict gestational diabetes mellitus by two statistical models including artificial neural ne...

متن کامل

Identification of the most important factors of ethnic differences in anthropometric dimensions of Iranian workers using the decision tree

Background and aims: Anthropometry is the branch of human science that considers the physical measurement of the human body, especially size and shape. One application of anthropometrical data in ergonomics is the design of working space and the development of industrialized products. So that the tools, equipment and workstations, which designed based on the physical dimensions of the workers, ...

متن کامل

Comparison of disability score estimation in multiple sclerosis patients with artificial neural network and decision tree models

Background: Multiple Sclerosis (MS) is one of the most debilitating disease among young adults. Understanding the disability score (Expanded Disability Status Scale (EDSS)) of these patients is helpful in choosing their treatment process. Calculating EDSS takes a lot of time for Neurologists, so having a way to estimate EDSS can be helpful. This study aimed to estimate the EDSS score of MS pati...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1611.04561  شماره 

صفحات  -

تاریخ انتشار 2016